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Abstract 


Big data areas are expanding in a fast way in terms of increasing workloads and 
rnntime systems, and this situation imposes a serious challenge to workload charac¬ 
terization, which is the foundation of innovative system and architecture design. The 
previous major efforts on big data benchmarking either propose a comprehensive but 
a large amount of workloads, or only select a few workloads according to so-called 
popularity, which may lead to partial or even biased observations. In this paper, on 
the basis of a comprehensive big data benchmark suite—BigDataBench, we reduced 
77 workloads to 17 representative workloads from a micro-architectural perspective. 
On a typical state-of-practice platform—Intel Xeon E5645, we compare the repre¬ 
sentative big data workloads with SPECINT, SPECCFP, PARSEC, CloudSuite and 
HPCC. After a comprehensive workload characterization, we have the following obser¬ 
vations. First, the big data workloads are data movement dominated computing with 
more branch operations, taking up to 92% percentage in terms of instruction mix, 
which places them in a different class from Desktop (SPEC CPU2006), CMP (PAR¬ 
SEC), HPC (HPCC) workloads. Second, corroborating the previous work, Hadoop 
and Spark based big data workloads have higher front-end stalls. Comparing with 
the traditional workloads i. e. PARSEC, the big data workloads have larger in¬ 
structions footprint. But we also note that, in addition to varied instruction-level 
parallelism, there are signihcant disparities of front-end efficiencies among different 
big data workloads. Third, we found complex software stacks that fail to use state- 
of-practise processors efficiently are one of the main factors leading to high front-end 
stalls. For the same workloads, the Lll cache miss rates have one order of magnitude 
differences among diverse implementations with different software stacks. 
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1 Introduction 


A consensus on big data system architecture is based on a shared-nothing hard¬ 
ware design [221 E] in which nodes communicate with one another only by sending 
messages via an interconnection network. Driven by cost-effectiveness, scale-out solu¬ 
tions, which add more nodes to a system, are widely adopted to process an exploding 
amount of data. Big data are partitioned across nodes, allowing multiple nodes to 
process large data in parallel, and this partitioned data and execution gives partitioned 
parallelism [7]. In nature, big data workloads are scale-out workloads mentioned [9]. 

Previous system and architecture work shows that different software stacks, e.g., 
MapReduce or Spark, have significant impact on user-observed performance [26] and 
micro-architectural characteristics [Uj. In addition to comprehensive workloads [23], 
different software stacks should thus be included in the benchmarks [13], but that 
aggravates both cognitive difficulty on workload characterization and benchmarking 
cost by multiplying the number of workloads, which are the foundations of innovative 
system and architecture design. 

To date, the previous major efforts on big data benchmarking either propose a 
comprehensive but a large amount of workloads (e.g. a recent comprehensive big 
data benchmark suite—BigDataBench [23], available from [1], includes 77 workloads 
with different implementations) or only select a few workloads according to so-called 
popularity [9], which may result in partial or biased observations. 

In this paper, we choose 45 metrics from micro-architecture aspects, including 
instruction mix, cache and TLB behaviors, branch execution, pipeline behaviors, 
off-core requests and snoop response, parallelism, and operation intensity for work¬ 
load characterization. On the basis of a comprehensive big data benchmark suite— 
BigDataBench, we reduces 77 workloads to 17 representative ones. This reduction not 
only guarantees the comprehensiveness of big data workloads, but also significantly 
decreases benchmarking costs and cognitive difficulty on workload characterization. 
We release the workload characterization and reduction fool named WCRT as an 
open-source project. 

We compare the seventeen representative big data workloads with SPECINT, 
SPECCFP, PARSEC, HPCC, CloudSuite, and TPC-C on the system consisting of 
Intel Xeon E5645 processors. To investigate the impact of different software stacks, 
we also add six workloads implemented with MPI (the same workloads included in the 
representative big data workloads). After a comprehensive workload characterization, 
we have the following observations. 

First, for the first time, we reveal that the big data workloads have more branch 
and integer instructions, which place them in a different class from desktop (SPEC 
CPU2006), CMP (PARSEC), HPC (HPCC) workloads. Though analyzing the in¬ 
structions breakdown, we found that the big data workloads are data movement 
dominated computing with more branch operations, which takes up to 92% percent¬ 
age in terms of instruction mix. 

Second, corroborating the previous work, the Hadoop and Spark based big data 
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workloads have higher front-end stalls. Comparing with the traditional workloads 
i. e. PARSEC, the big data workloads have larger instructions footprints. But we 
also note that, in addition to varied instruction-level parallelism, there are signihcant 
disparities of front-end efficiencies among different subclasses of big data workloads. 
From this angle, previous work such as CloudSuite only covers a part of the workloads 
included in BigDataBench. 

Third, we found complex software stacks that fail to use state-of-practise proces¬ 
sors efficiently are one of the main factors leading to high front-end stalls. For the 
same workloads, the Lll cache miss rates have one order of magnitude differences 
among diverse implementations with different software stacks, which are overlooked 
in the previous work 0 na H. For the MPI-version of the big data workloads, 
their Lll numbers are very close to the traditional benchmarks. In addition to inno¬ 
vative hardware design, we should pay great attention to co-design of software and 
hardware so as to use state-of-practise processors efficiently. 


2 Background 

2.1 BigDataBench 

BigDataBench is an open-source comprehensive big data benchmark suite. The cur¬ 
rent version-BigDataBench 3.0-includes 77 workloads covering four types of applica¬ 
tions (cloud OLTP, OLAP and interactive analytics, and offline analytics) and three 
popular Internet scenarios (search engine, social network and e-commerce). These 
workloads cover both basic operations and state-of-art algorithms, and each opera¬ 
tion/algorithm has multiple implementations built upon mainstream software stacks 
such as Hadoop and Spark. In short, BigDataBench aims at providing comprehensive 
workloads in order to meet the needs of benchmark users from different research fields 
such as architecture, system, and networking. 

2.2 WCRT 

WCRT is a comprehensive workload characterization tool, which can subset the whole 
workload set by removing redundant ones to facilitate workload characterization and 
other architecture research. It can also collect, analyze, and visualize a large num¬ 
ber of performance metrics. WCRT consists of two main modules: profilers and 
a performance data analyzer. On each node, a profiler is deployed to characterize 
workloads running on it. The profiler collects performance metrics specified by users 
once a workload begins to run, and transfers the collected data to the performance 
data analyzer when the workload completes. The analyzer is deployed on a dedicated 
node that does not run other workloads. After collecting the performance data from 
all profilers, the analyzer processes them using statistical and visual functions. The 
statistical functions are used to normalize performance data and perform principle 
component analysis. 
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3 Representative big data Workloads 


To reduce the cognitive difficulty of workload characterization and benchmarking 
cost, we use WCRT to reduce the number of workloads in benchmarking from the 
perspective of micro-architecture. As the input for the WCRT tool, we choose 45 
micro architecture level metrics, covering the characteristics of instruction mix, cache 
behavior, translation look-aside buffer (TLB) behavior, branch execution, pipeline 
behavior, off-core requests and snoop responses, parallelism, and operation intensity. 
Due to limited space, we give the details of these 45 metrics on our web page which is 
available from [1]. Then we normalize these metric values to a Gaussian distribution 
and use Principle Component Analysis (PCA) to reduce the dimensions. Finally we 
use K-Means to cluster the 77 workloads, and there are 17 clusters in the final results. 


3.1 Original data sets of representative workloads 

There are seven data sets for representative workloads. As shown in Table [U these 
data have different types and sources and application domains. The original data set 
can be scaled by the BDGS provided by BigDataBench. The more details can be 
obtained from [1]. 


Table 1: The summary of data sets and data generation tools. 


No. 

data sets 

data set description 

scalable data 

set 

1 

Wikipedia Entries 

4,300,000 English ar¬ 
ticles 

Text Genera¬ 
tor of BDGS 

2 

Amazon Movie Re¬ 
views 

7,911,684 reviews 

Text Genera¬ 
tor of BDGS 

3 

Google Web Graph 

875713 nodes, 

5105039 edges 

Graph Gener¬ 
ator of BDGS 

4 

Facebook Social Net¬ 
work 

4039 nodes, 88234 
edges 

Graph Gener¬ 
ator of BDGS 

5 

E-commerce Trans¬ 
action Data 

Table 1: 4 columns, 
38658 rows. Table 
2: 6 columns, 242735 

rows 

Table Genera¬ 
tor of BDGS 

6 

ProfSearch Person 
Resumes 

278956 resumes 

Table Genera¬ 
tor of BDGS 

7 

TPG-DS WebTable 
Data 

26 tables 

TPG DSGen 
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3.2 Behaviors characteristics of representative workloads 

The representative workloads are implemented by different approaches. In Table HI 
we give descriptions of each representative big data workload. We describe each 
workload from the perspective of system behaviors, data behaviors and application 
category. 

3.2.1 System Behaviors 

As the variations of big data workloads, we use the system behaviors to classify and 
characterize them. We choose CPU Usages, DISK 10 behaviors and 10 Bandwidth 
to analyze the system behaviors of big data workloads. First, the CPU usages are 
described by CPU utilization and 10 Wait ratio. CPU utilization is defined as the 
percentage of time that the CPU executing at the system or user level, while I/O Wait 
ratio is defined as the percentage of time that the CPU waiting for outstanding disk 
I/O requests. Second, DISK I/O performance is a key metric of big data workloads. 
We investigate the DISK I/O behavior with the average weighted Disk I/O time 
ratio. Weighted disk I/O time is defined as the number of I/O in progress times the 
number of milliseconds spent doing 1/O since the last update, and the average weighted 
Disk I/O time ratio is the weighted Disk I/O time divided by the running time of 
the workload. Third, we choose the disk I/O bandwidth, network I/O bandwidth 
which can reflect the I/O throughput requirements of big data workloads. Based 
on the above metrics, we roughly classify the workloads into three category: CPU¬ 
intensive workloads, which have high CPU utilizations, low average weighted Disk 
I/O time ratio or I/O Bandwidth; I/O-intensive workloads, which have high average 
weighted Disk I/O time ratio or I/O Bandwidth but low CPU utilizations; and hybrid 
workloads, whose behaviors are between CPU-intensive and lO-intensive workloads. 
In this paper, the rule of classifying big data workloads is as follows: 1) For a workload, 
if the CPU utilization is larger than 85%, we consider it CPU-Intensive; 2) For a 
workload, if the average weighted Disk I/O time ratio is larger than 10 or the I/O 
wait ratio is larger than 20% and the CPU utilization is less than 60%, we consider 
it I/O-Intensive; 3) other workloads excepting the CPU-intensive and I/O-intensive 
ones are considered as hybrid workloads. 

3.2.2 Data Behaviors 

For each workload, we characterize the data behaviors from perspective of data 
schema and data processing behaviors which measure the ratios of data input, out¬ 
put and intermediate data. For data schema, we will describe the data structure 
and semantic information of each workload. For data processing behaviors, we will 
describe the ratio of input and output and the intermediate data. We use larger, less 
and egual to describe the data capacity changing. For example, when the ratio of 
the data output to the data input is larger than or equal to 0.9 and less than 1.1, we 
consider Output=Input; when the ratio of the data output to the data input is larger 


than or equal to 0.01 and less than 0.9, we consider Output<Input; when the ratio 
of the data output to the data input is less than 0.01, we consider Output<<Input; 
when the ratio of the data output to the data input is greater than or equal to 1.1, 
we consider Output>Input. The rule is inspired by Luo et ah [6]. 

3.2.3 Application category 

We consider three application categories: data analysis workloads, service workloads 
and interactive analysis workloads. 


4 Experimental Configurations and Methodology 

This section presents experiment conhgurations and methodology, respectively. 

4.1 Experiment Configurations 

To obtain insights into the system and architecture for big data, we run a series of 
experiments using the seventeen representative workloads. 

Jia et al. na found that software stacks have a serious impact on big data work¬ 
loads in terms of micro-architectural characteristics. Compared to traditional software 
stacks, big data software stacks, e. g., Hadoop or Spark usually have more complex 
structures, enabling programmers to write less code to achieve their intended goals. 
The upshot is two-fold. On one hand, the ratio of system software and middleware 
instructions executed compared to user applications instructions tends to be large, 
which makes their impact on system behavior large, as well. On the other hand, they 
have larger instruction footprint. For example, the previous work iiaizii reported 
higher Lll cache miss rate for Hadoop and Spark workloads. For comparison, in addi¬ 
tion to BigDataBench subset, we also add MPI implementations of six data analysis 
workloads, including Bayes, K-means, PageRank, Grep, WordCount and Sort. The 
reason of choosing MPI is as follows: first, MPI can implement all of the opera¬ 
tor primitives of Hadoop or Spark but with much sophisticated programming skill. 
Second, comparing with Hadoop and Spark, the MPI stacks is much thinner. 

For the same big data application, the scale of the system running big data ap¬ 
plications is mainly decided by the size of the input data. For experiments in this 
paper, the input data is about 128GB, except PageRank workload, which is mea¬ 
sured in terms of the number of vertices. We deploy the big data workloads on the 
system with a matching scale—5 nodes. On our testbed, each node owns one Xeon 
E5645 processor equipped with 32 GB memory and 8 TB disk as listed in Table El 
In the rest of the experiments, hyperthreading is disabled on our testbed because 
enabling these features makes it more complex to measure and interpret performance 
data m- The operating system is Gentos 6.4 with Linux kernel 3.10.11. The Hadoop 
and JDK distribution is 1.0.2 and 1.6, respectively. The Spark distribution is 1.0.2. 
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Table 2: Details of the representative big c 

ata workloads. 

ID 

The name of 
representative 
big data work¬ 
load (Abbr.) 

Description of the representa¬ 
tive workload 

Data Descrip¬ 
tion 

category 


1 


Basic operation of reading in 
HBase which is a popular non¬ 
relational, distributed database. 

ProfSearch data 
set, each record is 
1128 bytes K-V 
text file 

service 


2 

Hive-Difference 

(H-Difference) 

(9)1 

Hive implementation of set differ¬ 
ence, one of the five basic operator 
from relational algebra. 

E-commerce 

Transaction data 
set, each record is 
52 bytes K-V text 
file 

mill 


3 

Impala- 
SelectQuery 
(I-Select Query) 

(9)1 

Impala implementation of select 
query to filter data, filter is one 
of the five basic operator from re¬ 
lational algebra. 

E-commerce 
Transaction data 
set, each record is 
52 bytes K-V text 
file 

mH 


4 

Hive-TPC-DS-Q3 
(H-TPC-DS- 
queryS) (9)^ 

Hive implementation of query 

3 of TPC-DS, a popular deci¬ 
sion support benchmark proposed 
by Transection Processing Per¬ 
formance Council, complex rela¬ 
tional algebra. 

TPC-DS Web 

data set, each 

record is 14 KB 
K-V text file 

in 


5 

Spark- 
Word Count 
(S-WordCount) 

(8)1 

Spark implementation of word 
counting which counts the num¬ 
ber of each word in the input file. 
Counting is a a fundamental op¬ 
eration for big data statistics an¬ 
alytics. 

Wikipedia data 

set, each record 
is 64KB K-V text 
file 

data analysis 


6 

Impala-OrderBy 
(I-OrderBy) (7)1 

Impala implementation of sorting, 
a fundamental operation from re¬ 
lational algebra and extensively 
used in various scene. 

E-commerce 
Transaction data 
set, each record is 
52 bytes K-V text 
file 

■mi 


7 

Hadoop-Grep (H- 
Grep) (7)1 

Searching plain text file for lines 
that match a regular expression 
by Hadoop MapReduce. Search¬ 
ing is another fundamental opera¬ 
tion widely used. 

Wikipedia data 

set, each record 
is 64KB K-V text 
file 

data analysis 


8 

Shark-TPC-DS- 
QIO (S-TPC- 

DS-querylO) 

(4)1 

Shark implementation of query 10 
of TPC-DS, complex relational al¬ 
gebra. 

TPC-DS Web 

data set, each 

record is 14 KB 
K-V text file 

■■1 


9 

Shark-Project (S- 
Project) (4)^ 

Shark implementation of project, 
one of the five basic operator from 
relational algebra. 

E-commerce 
Transaction data 
set, each record is 
52 bytes K-V text 
file 

mH 


10 

Shark-OrderBy 
(S-OrderBy) (3)1 

Shark implementation of sorting. 

E-commerce 
Transaction data 
set, each record is 
52 bytes K-V text 
file 

■H 


11 

Spark-Kmeans (S- 
Kmeans) (1)^ 

Spark implementation of k-means 
which is a popular clustering algo¬ 
rithm in Discrete mathematics for 
partitioning n observations into k 
clusters . 

Facebook data 

set, each record is 
94 bytes K-V text 
file 

data analysis 


12 

Shark-TPC-DS- 
Q8 (S-TPC-DS- 
queryS) (1)^ 

Shark implementation of query 8 

of TPC-DS. 

TPC-DS Web 

data set, each 

record is 14 KB 
K-V text file 



13 

Spark-PageRank 
(S-PageRank) 

(1)1 

Spark implementation of PageR¬ 
ank, which is a graph computing 
algorithm used by Google to score 
the importance of the web page by 
counting the number and quality 
of links to the page. 

Google data set, 
each record is 

6KB K-V text file 

data analysis 


14 

Spark-Grep (S- 

Grep) (1)1 

Spark implementation of Grep. 

Wikipedia data 

set, each record 
is 64KB K-V text 
file 

data analysis 


■ 

Hadoop- 
Word Count 
(H-Word Count) 

(1)1 


Wikipedia data 

set, each record 
is 64KB K-V text 
file 

data analysis 


16 

Hadoop- 

NaiveBayes 

(H-NaiveBayes) 

(1)1 

Hadoop implementation of naive 
bayes which is a simple but widely 
used probabilistic classifier in sta¬ 
tistical calculation. 

Amazon data set, 
each record is 

52KB K-V text 
file 

data analysis 


17 

Spark-Sort (S- 

Sort) (1)^ 

Spark implementation of sorting. 

Wikipedia data 

set, each record 
is 64KB K-V text 
file 

data analysis 



^The number of workloads that the selected workloads can represent are given in parentheses. 


Data Process¬ 
ing Behaviors 

■■■ 

Output = Input 
and no intermedi¬ 
ate 

lO-Intensive 

Output < Input 
and Intermedi- 

ate<Input 

lO-Intensive 

Output < Input 
and no Intermedi- 
ate< <Input 

lO-Intensive 

Output = Input 
and no Intermedi¬ 
ate 

Hybrid 

Output << Input 
and Intermedi- 

ate<Input 

lO-Intensive 

Output = Input 
and Intermedi- 

ate=Input 

Hybrid 

Output << Input 
and Intermedi- 

ate< <Input 

CPU-Intensive 

Output << Input 
and no Intermedi¬ 
ate 

Hybrid 

Output < Input 
and no Intermedi¬ 
ate 

lO-Intensive 

Output = Input 
and Intermedi- 

ate=Input 

lO-Intensive 

Output = Input 
and Intermedi- 

ate=Input 

CPU-Intensive 

Output << Input 
and no Intermedi¬ 
ate 

Hybrid 

Output > Input 
and Intermedi- 

ate>Input 

CPU-Intensive 

Output << Input 
and Intermedi- 

ate< <Input 

lO-Intensive 

Output << Input 
and Intermedi- 

ate< <Input 

CPU-Intensive 

Output << Input 
and Intermedi- 

ate< <Input 

CPU-Intensive 

Output = Input 
and Intermedi- 

ate=Input 

Hybrid 


10 


































































































































Table 3: Node configuration details of Xeon E5645 


CPU type 

Intel (R)Xeon E5645 

Number of cores 

6 cores@2.40G 

LI DCache 

LI ICache 

L2 Cache 

L3 Cache 

6 X 32 KB 

6 X 32 KB 

6 X 256 KB 

12MB 


The HBase, Hive and MPICH2 distribution is 0.94.5, 0.9, 1.5, respectively. Table [2] 
shows the workload summary. 

4.2 Experiment Methodology 

Intel Xeon processors provide hardware performance counters to support micro-architecture 
level profiling. We use Perf, a Linux profiling tool, to collect about dozens of events 
whose numbers and unit masks can be found in the Intel Developer’s Manual. In 
addition, we access the proc file system to collect OS-level performance data. We 
collect performance data after a ramp up period, which is about 30 seconds. 

4.3 The Other Benchmarks Setup 

For SPEC CPU2006, we ran the official applications with the first reference input, and 
separated the average results into two groups: integer benchmarks [SPECINT) and 
floating point benchmarks {SPECEP). We have used HPCC 1.4, which is a represen¬ 
tative HPC benchmark suite, for the experiment. We ran all of the seven benchmarks 
in HPCC. PARSEC is a benchmark suite composed of multi-threaded programs, and 
we deployed PARSEC 3.0 Beta Release. We ran all the 12 benchmarks with native 
input data sets and used GCC 4.1.2 for compiling. CloudSuite is a benchmark suite 
composed of Cloud scale-out workloads, and we deployed CloudSuite 1.0 Release. 

We ran all the six benchmarks with input data that correspond with our benchmark 
data sets. TPC-C is an online transaction processing (OLTP) benchmarks, and our 
deployment is tpcc-uva vl.2. 


5 Experiment Results and Observations 

In this section, we report the micro-architecture behaviors through reporting instruc¬ 
tion mix, pipeline efficiency and cache efficiency. Furthermore, to further understand 
the micro-architecture behaviors, we investigate the footprint of big data workloads 
and the software impacts for big data workloads. For better clarification, in addition 
to the average behaviors, we also report the behaviors of three subclasses of big data 
workload classified in Table 2. We compare the representative big data workloads 
with PARSEC, SPECINT, SPECFP, HPCC, CloudSuite and TPC-C. 
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Figure 1: The instruction breakdown of different workloads on X86 platforms. 


Section 15.11 introduces the instructions mix; section 15.21 introduces instruction 
level parallelism; section 15.31 introduces the cache behaviors; section 15.41 introduces 
locality; section 15.51 introduces the software stacks impacts for big data workloads; 
section 15.61 is the summary. 

5.1 Instruction Mix 

In order to reveal instruction behaviors of big data workloads, we choose the instruc¬ 
tion mix as the metric. Figure [T] shows the retired instruction breakdown, and we 
have two observations for big data workloads: 

First, Big data workloads have more branch instructions', the average branch 
instruction percentage of big data workloads is 18.7%, which is larger than those 
of HPCC, PARSEC, SPECFP and SPECINT distinctly. Furthermore, in big data 
workloads, from the application category dimension, the average branch ratio of the 
service, data analysis and interactive analysis are 18%, 19% and 19% respectively; 
from the system behavior dimension, the average branch ratio of the CPU-intensive, 
I/O intensive, and hybrid workloads are 19%, 18% and 19% respectively. 

The above observations can be explained as follows. First, big data analysis 
workloads (including data analysis and interactive analysis), as shown in Table 2, 
are based on discrete mathematics (such as graph computation), relation algebra and 
mathematical statistics (such as classihcation). This is different from the traditional 
numerical computation such as linear or differential equation in the scientihc com¬ 
puting community. Traditional scientihc computing code owns larger basic blocks 
as complex formula calculations |T9], but big data analysis workload kernel code is 
biased to simple and conditional judgement operations. For example. Algorithm [T] 
shows the kernel pseudocode of Kmeans, which includes a lot of judgements in the 
main loop (line 4 to 10) and the ComputeDist function block in the Algorithm [1] is 
simple, which contains only 40 lines in the real application; second, service work¬ 
loads (Cloud OLTP) always need to conduct different processing steps to deal with 
diverse user requests, which result in many conditional judgment operations. The 
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Switch-Case style is adopted frequently. This also is similar to the traditional service 
workloads, such as the TPC-C workload, which also has very high branch instruction 
ratio (30%). 


Algorithm 1 Kmeans {} 

Input: Global variables centers] 

The offset key; 

The sample value. 

Output: < key', value' > pair, where key' is the index of the closest center point 
and value' is a string comprising of sample information. 

1. Construct the sample instance from value] 

2. niinDis = Double.MAX_VALUE] 

3. index = —1; 

4. for i = 0 to centers.length do 

5. dis = ComputeDistiinstance, centers[i])] 

6. if dis < minDis then 

7. minDis = dis] 

8. index = i] 

9. end if 

10. end for 

11. Take index as key'] 

12. Construct value' as a string comprising of the values of different dimensions; 

13. return < key', value' > pair; 


Furthermore, we prohle the branch mis-prediction ratios on two X86 platforms: 
Intel Xeon E5645 and Intel Atom D510. Intel Atom D510 is a low-power processor 
with simple branch predictors and Intel Xeon E5645 is a server-processor with sophis¬ 
ticated predictors. Our result shows that the average branch mis-prediction ratio on 
Intel Atom D510 processor is 7.8%, whereas the branch mis-prediction ratio on Intel 
E5645 processor is only 2.8%. As shown in Table IU we found that E5645 has more 
sophisticated branch prediction mechanisms, e.g., the loop counter, indirect jumps 
and calls. Furthermore it is equipped with more BTB (Branch Target Buffer) entries. 
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Table 4: The summary of branch prediction mechanisms. 


Component 

D510 

E5645 

Conditional 

jumps 

two-level 
adaptive pre¬ 
dictor with a 
global history 
table 

hybrid predic¬ 
tor combining 
a two-level 

predictor and 
a loop counter 

Indirect 
jumps and 

calls 

Not 

two-level pre¬ 
dictor 

BTB Entries 

128 

8192 

Misprediction 

penalty 

15 cycles 

11-13 cycles 


Second, big data workloads have more integer instructions: On Intel Xeon E5645, 
the average integer instruction ratio is 38%, which is much higher than those of 
HPCC, PARSEC, SPECFP workloads. Also this value is close to that of CloudSuite 
(the ratio is 34%) , SPECINT (the ratio is 41%) and TPC-C(33%) workloads, which 
are also integer dominated workloads. Furthermore, in big data workloads, from the 
application category dimension, the average integer instruction ratio of the service, 
data analysis and interactive analysis workloads are 40%, 38% and 38% respectively; 
from the system behavior dimension, the average integer instruction ratio of the CPU¬ 
intensive, I/O intensive, and hybrid workloads are 37%, 39% and 38% respectively. 

The above observations can be explained as follows. First, many big data work¬ 
loads are not floating-point dominated workloads (e.g. Sort, Grep, WordCount, and 
most of the Query and Cloud OLTP). Second, the floating-point dominated workloads 
such as Bayes, Kmeans and PageRank need to process massive amount of operations 
before they perform the floating-point operations. For example, the address calcula¬ 
tion, branch calculation, all of which are integer operations. 

Furthermore, we analyze the integer instruction breakdown of big data algorithm 
through inserting the analysis code into the source code to analyze the details of inte¬ 
ger operations. We classify all operations into three classes. The first class is integer 
address calculation, such as locating the position in the integer array; the second class 
is floating point address calculation, such as locating the position in the floating-point 
array; the third class is other calculations such as computations or branch calcula¬ 
tions. From Figure [H we can see that the average integer instruction breakdown: 
64% is integer address calculating, 18% is floating point address calculating and 18% 
is other calculations in big data workloads. 

Combining the results of Figure [Hand Figure [21 the average ratio of load/store and 
address calculation related instructions is rough 73% for big data workloads. These 
instructions are all related with data movement. This ratio increases to 92% when 
further considering the branch instructions. Hence it is reasonable to say that big data 
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Figure 2: The integer instruction breakdown on X86 platforms. 


workloads are data movement dominated computing with more branch operations. 

In this subsection, we only investigate instruction mix based on the Intel Xeon 
E5645, which is a typical X86 pocessor. Blem etl. [S] have proved that ISAs of RISC 
and CISC are irrelevant to modern microprocessor performance and the difference of 
instruction mix between different ISA sets is indistinguishable. 

[Implications]. 

High branch instruction ratio indicates that sophisticated branch prediction mech¬ 
anisms should be used for big data workloads to reduce the branch mis-prediction, 
which will let pipeline flush all the wrong instructions to fetch the correct ones and 
cause high penalty. 

The large percentage of integer instruction implies that the floating point units 
in the processor should be designed appropriately to match the floating point per¬ 
formance. For examples, the E5645 processors can achieve 57.6 GFLOPS in the¬ 
ory, but the average floating point performance of big data workloads is about 0.1 
GFLOPS. Furthermore the latest Xeon processor like Dual Xeon E5 2697 can achieve 
345 GFLOPS, thus incurring a serious waste of floating point capacity and hence die 
size. 

5.2 ILP 

As shown in Figure[31 the average IPG of the big data workloads is 1.28. The average 
IPG of the big data workloads is larger than those of SPEGFP (1.1) and SPEGINT 
(0.9), as same as PARSEG (1.28), and slightly smaller than HPGG(1.5). This im¬ 
plies that the instruction-level parallelism of big data workloads is not considerably 
different from other traditional workloads. 

Furthermore, in big data workloads, from the application category dimension, the 
average IPG of the service, data analysis and interactive analysis workloads are 0.8, 

1.2 and 1.3 respectively; from the system behavior, the average IPG of the GPU¬ 
intensive, I/O intensive, and hybrid workloads are 1.3, 1.2 and 1.3 respectively. So 
we can observe there are varied ILP for different category of applications. There 
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Figure 3: The IPC of different workloads on the X86 platform. 

are significant disparities of IPC among different big data workloads, and the service 
workloads have lower IPC. This result corroborates with the report in [9], which 
confirmed most of services workloads in CloudSuite have low IPC (the average number 
is 0.9). Similar to CloudSuite, we also notice that H-Read (0.8) is the only one 
service workload in the 17 workloads that has low IPC. However, a large percentage 
of the big data workloads (in BigDataBench subset) have higher IPC than the average 
number of SPECFP. Even several workloads have quite high IPC. Examples are S- 
Project (1.6) and S-TPC-DS-query8 (1.7). These observations have two points: hrst, 
CloudSuite only covers a part of the workloads included in BigDataBench in terms 
of IPC numbers. Second, there are significant disparities of IPC among different big 
data workloads. 

[Implications] . Architecture communities are exploring different technology road 
maps for big data workloads: some focuses on scale-out wimpy core (e.g. in-order 
cores), for example, HP’s Moonshot uses Intel Atom processors [2] and Facebook’s 
interest is in ARM processors [12]; others internet service providers try to use brawny 
core or even accelerators, e. g., GPGPU for CPU-intensive computing like deep 
learning. Work in [9] advocates use of modest degree of superscalar out-of-order 
execution. The ILP analysis of the big data workloads shows that there are different 
subclasses of big data workloads, which is also confirmed by our analysis of cache 
behaviors in Section 15.31 We speculate that the processor architecture should not 
have one-size-fits-all solution. 

5.3 Cache Behaviors 

Lll Cache Behaviors: As the front-end pipeline efficiencies are closely related 
with the Lll Cache behaviors, we evaluate the pipeline front-end pipeline efficiencies 
through investigating the Lll Cache MPKI (Miss per kilo instructions). In Figure S] 
the average Lll MPKI of the big data workloads is 15. Corroborating previous 
work iiniBl. the average Lll MPKI of the big data workloads is larger than those 
of SPECFP, SPECINT, HPCC and PARSEC, but lower than that of CloudSuite (the 
average Lll is large as 32). 

Furthermore, in big data workloads, from the application category dimension, the 
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Figure 4: The Cache behaviors of different workloads on the X86 platform. 


average Lll MPKI of the service, data analysis and interactive analysis workloads 
are 51, 13 and 14 respectively; from the system behavior dimension, the average Lll 
MPKI of the CPU-intensive, I/O intensive, and hybrid workloads are 8, 22, and 9, 
respectively. There are significant disparities of Lll cache MPKI among different big 
data workloads. Among those workloads, H-Read has the highest Lll MPKI (51). The 
reason of high Lll MPKI for H-Read is that, as a service workload, the user requests 
are more stochastic and hence the instruction executions are more stochastic than 
other workloads. So the instruction footprint should be more larger. We also inves¬ 
tigated other service workloads. Corroborating the report in [9], we found that most 
of them have high Lll MPKI. Examples are TPC-C and Streaming, Olio, Cloud9, 
Search in the CloudSuite. Our system behavior analysis in Section [3] shows that most 
of the CloudSuite workloads will be classihed into the I/O intensive workloads ac¬ 
cording to our rule, and hence higher Lll MPKI is reported in CloudSuite. These 
observations indicate that CloudSuite only covers a part of the workloads included in 
the BigDataBench subset again. 

L2 Cache and LLC Behaviors: As LID cache miss penalty can be hidden by 
modern out-of-order pipeline, we evaluate the data-access efficiencies through inves¬ 
tigating the L2 and L3 Cache MPKI. 

In Figure m the average L2 MPKI of the big data workloads is 11. The average L2 
MPKI of the big data workloads is larger than those of HPCC and PARSEC, smaller 
than CloudSuite, TPC-C, SPECFP and SPECINT. Furthermore, in big data work¬ 
loads, from the application category dimension, the average L2 MPKI of the service, 
data analysis and interactive analysis workloads are 32, 11 and 8 respectively; from 
the system behavior dimension, the average L2 MPKI of the CPU-intensive, I/O in¬ 
tensive, and hybrid workloads are 6.8, 11.6, and 7.7 respectively. There are disparities 
of L2 cache MPKI among different big data workloads. Among the workloads. The 
lO-intensive workloads and service workloads undergo more L2 MKPI. 

In Figure IH the average L3 MPKI of the big data workloads is 1.2. The aver¬ 
age L3 MPKI of the big data workloads is smaller than all of the other workloads. 
Furthermore, in big data workloads, from the application category dimension, the 
average L3 MPKI of the service, data analysis and interactive analysis workloads are 
1.2, 1.7 and 0.8 respectively; from the system behavior dimension, the average L3 
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Figure 5: The TLB behaviors of different workloads on X86 platforms. 


MPKI of the CPU-intensive, I/O intensive, and hybrid workloads are 1.7, 1.2, and 
0.9 respectively. There are disparities of L3 cache MPKI among different big data 
workloads. Among the workloads. The CPU-intensive workloads and data analysis 
workloads undergo more L3 MKPI. 

TLB Behaviors: In Figure [5l the average ITLB MPKI of the big data workloads 
is 0.05. The average ITLB MPKI of the big data workloads is larger than those of 
HPCC, PARSEC, TPC-C, SPECFP and SPECINT, smaller than CloudSuite. Fur¬ 
thermore, from the application category dimension, the average ITLB MPKI of the 
service, data analysis and interactive analysis workloads are 0.2, 0.04 and 0.04 respec¬ 
tively; from the system behavior dimension, the average ITLB MPKI of the CPU¬ 
intensive, I/O-intensive, and hybrid workloads are 0.03, 0.08, and 0.05, respectively. 
The service and lO-intensive workloads undergo more ITLB MKPI. 

The average DTLB MPKI of big data workloads is 0.9. The average DTLB 
MPKI of the big data workloads is close to those of HPCC and PARSEC, smaller 
than those of CloudSuite, TPC-C, SPECFP and SPECINT. Furthermore, from the 
application category dimension, the average DTLB MPKI of the service, data analysis 
and interactive analysis workloads are 1.8, 1.1 and 0.5 respectively; from the system 
behavior dimension, the average DTLB MPKI of the CPU-intensive, I/O-intensive, 
and hybrid workloads are 1.3, 0.7, and 0.6, respectively. The CPU-intensive and 
service workloads undergo more DTLB MKPI. 

[Implications] Our results show that in addition to varied instruction-level par¬ 
allelism, there are signihcant disparities of front-end efficiencies among different big 
data workloads with respect to CloudSuite. These observations indicate that Cloud¬ 
Suite only covers a part of the workloads included in BigDataBench. 

5.4 Locality 

As the cache behavior results on E5645 is interesting, such as the average Lll MPKI 
is higher, we further investigate it. The workload’s instruction or data footprint can 
reflect the workload’s locality, which is mainly related with the cache miss ratio met¬ 
rics of the workload on the specific processor. In this subsection, we used simulator: 
MARSSx86, to evaluate the instruction and data footprint of the representative big 
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data workloads. As the micro-architecture simulation is very time-consuming, we 
only choose the Hadoop workloads as the case study. 

5.4.1 Simulator Configurations 

The locality evaluation methods are similar to jl] and [3]. The temporal locality of a 
program can be estimated by analyzing the miss rate curve with the cache capacity 
changing. In this paper, we use MARSSx86 as the simulator and change the LID 
and Lll cache size for evaluation. 

System Configurations. 

First, we set processor architecture as Atom-like in-order pipeline with a single 
core. Second, we set two-level cache with LID, Lll and L2: 8-way associative LI 
cache with 64 byte lines and shared 8-way associative L2 cache with 64 byte lines. 
Third, we change the LI cache size from 16 KB to 8192 KB and record the cache 
miss ratio of each run. 

Workloads Configurations. For the Hadoop workloads, we choose the input 
data size as 64MB and set one map and one reduce slot. We use simsmall to drive 
PARSEC in the simulator. 

Running Configurations. First, we performs the functional simulation: Qemu, 
to skip first billion of instructions which is the prepared stage of Hadoop, then we 
switch to detailed simulation: MARSSx86 to execute. We choose five segments of 
Hadoop workloads, which include Map 0% to 1%, Map50% to 51%, Map 99% to 
100%, Reduce 0% to 1% and Reduce 99% to 100%. The cache miss ratio of hadoop 
workloads are the weighted mean of five segments. 

5.4.2 Experiments results 

Figure |6] reports the average instruction cache miss ratios versus cache size for the 
representative big data workloads and PARSEC. From Figure [6|, we can find that 
instruction cache miss ratio of the Hadoop workloads are larger than those of PARSEC 
workloads distinctly. The footprint of PARSEC is about 128 KB in experiments, but 
that of big data Hadoop workloads is about 1024 KB. The main reason should be 
that Hadoop has deep software stacks, which makes instruction footprint larger, and 
Jia [H] also describes the same viewpoint from the code size. 

Figure [7] reports the average data cache miss ratios versus increasing cache size for 
the representative big data workloads and PARSEC. From Figure [TJ we observe that 
the data cache miss ratio of PARSEC and Hadoop workloads are close after 64KB. 
This observation contradicts our intuition that big data workloads should have larger 
data footprint as it processes huge data. 

Figure m reports the average cache miss ratios versus increasing cache size for the 
representative big data workloads and PARSEC. From Figure El we can see that the 
cache miss ratio of PARSEC and the Hadoop workloads are close after 1024KB. 

[Implications]. 
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Figure 6: Instructions Cache miss ratio versus Cache size. 



Figure 7: Data Cache miss ratio versus Cache size. 



Figure 8: Cache miss ratio versus Cache size. 
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The footprint should be more related with the cache efficiency of workloads on 
the processor. The footprint result is also in accord with the cache behaviors results. 
Furthermore, in the footprint experiments, we have two observations: hrst, for the 
Hadoop workloads, which have complex software stacks, the capacity requirement 
of Lll cache capacity is more larger than that of traditional workloads; second, the 
capacity requirements of LID cache and other level cache (such as L2 and L3) which 
are shared by instruction and data is not signihcantly different between the Hadoop 
workloads and the traditional workloads. 

5.5 Software Stacks Impacts 

In order to investigate the software stacks deeply, we also add six workloads im¬ 
plemented with MPI (the same workloads included in the representative big data 
workloads). As shown in from Figure. |3]to Figure. [9l we have the following observa¬ 
tions: 

First, for the same algorithms or operations, when implemented with different 
software stacks, the latter has a serious impact on IPC. As shown in Figure [31 the 
IPC of M-WordCount (implemented with MPI) is 1.8 while those of the Hadoop and 
Spark implementations are only 1.1 and 0.9, respectively. The average IPC of MPI 
workloads is 1.4 and that of the other workloads is 1.16. The gap is 21%. 

Second, for the same algorithms or operations, software stacks have signihcant 
impact on processor front-end behaviors. As shown in Figure IH the MPI versions 
of the big data workloads have lower Lll MPKI. For example, the Lll MPKI of M- 
WordCount is 2 while those of the Hadoop and Spark implementations are 7 and 
17, respectively, and there are one order of magnitude differences. Furthermore, 
we calculate the average Lll MPKI of the MPI version workloads, and the number 
is only 3.4, while that of the Spark or Hadoop versions is 12.6. This implies that 
complex software stacks that fail to use state-of-practise processor efficiently are one 
of the main factors leading to high front-end stalls. Furthermore, we investigate the 
footprint of the MPI big data workloads. Figure [9] shows the average instruction cache 
miss ratios versus cache capacity for the MPI-based big data workloads and PARSEC. 
From Figure m we can see that instruction cache miss ratios of the MPI-version big 
data workloads are equal to those of PARSEC workloads, and less than those of the 
Hadoop workloads. This implies that the instruction footprint of the MPI workloads 
are similar to those of PARSEC workloads. 

Third, the software stacks signihcantly impact not only the Lll behaviors but 
also the L2 cache and LLC behaviors. As shown in Figure 01 the L2 and L3 MPKI 
of M-WordCount are 0.8 and 0.1 while those of the Hadoop version are 8.4 and 1.9 
respectively; those of the Spark version are 16 and 2.7, respectively. 

5.6 Summary 

We summarize our observations as follows. 
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First, the big data workloads are data movement dominated computing with more 
branch operations, taking up to 92% percentage in terms of instruction mix. 

Second, there are significant disparities of IPC among different big data workloads. 
And the average instruction-level parallelism of the big data workloads on the main¬ 
stream processor is not distinctively different from other traditional workloads. We 
also note that there are significant disparities of front-end efficiencies among different 
big data workloads. 

Third, complex software stacks that fail to use state-of-practise processor effi¬ 
ciently are one of the main factors leading to high front-end stalls. For the same 
workloads, the Lll cache miss rates have one order of magnitude differences among 
the diverse implementations using different software stacks. 

Finally, we conhrmed the observation in [15]: software stacks have significant im¬ 
pact on others micro-architecture characteristics, e.g., IPC, L2 Cache, LLC behaviors. 
In addition to innovative hardware design, we should pay great attention to co-design 
of software and hardware so as to use state-of-practise processors efficiently. 


6 Related Work 

Big data attract great attention, appealing many research efforts on big data bench¬ 
marking. Wang et al. [23] develop a comprehensive big data benchmark suite- 
BigDataBench, but it consists of too many workloads resulting in expensive overhead 
for conventional simulation-based methods. In comparison, Ferdman et al. [9] pro¬ 
pose CloudSuite, consisting of seven scale-out data center applications, but they only 
select a few workloads according to the so-called popularity, leading to partial or 
biased observations, which is confirmed in Section [5l Xi et al. [25] measure the mi- 
croarchitectural characteristics of search engine. Ren et al. 123 provide insight into 
performance and job characteristics via analyzing Hadoop traces derived from a 2000- 
node production Hadoop cluster in TaoBao. Smullen et al. [21] develop a benchmark 
suite for unstructured data processing, and present four benchmarks which capture 
data access patterns of core operations in a wide spectrum of unstructured data pro¬ 
cessing applications. Huang et al. m introduce HiBench, evaluating a specific big 
data platform—Hadoop with a set of Hadoop programs. 
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Much research work focuses on workload analysis. Jia et ah [13] characterize 
data analysis workloads in data centers. They conclude that inherent characteristics 
exist in many data analysis applications, different from desktop, HPC, traditional 
server workloads, and scale-out service workloads. Tang et al. [23| study the impact 
of sharing resources on five datacenter applications, including web search, bigtable, 
content analyzer, image stitcher and protocol buffer. They hnd that a sizable beneht 
and potential degradation exist from resource sharing effects. Mishra et al. [IS] 
propose a task classihcation methodology, in consideration of workloads dimensions, 
clustering and break points of qualitative coordinates, and apply it to the Google 
Cloud Backend. Jia et al. [15] evaluate the microarchitectural characteristics of big 
data workloads—BigDataBench. They hnd that software stacks e.g. MapReduce v.s. 
Spark have signihcant impact on user-observed performance and micro-architectural 
characteristics. Our work conhrms this work again such that we develop an open 
source workload characterization tool which can automatically reduce comprehensive 
workloads to a subset of representative workloads. 

Several work proposes system-independent characterization approaches. Hoste et 
al. [in] measure the characteristics of 118 benchmarks by collecting microarchitecture- 
dependent and microarchitecture-independent characteristics. Eeckhout et al. mm 
exploit program microarchitecture independent characteristics and measure bench¬ 
mark similarity. Phansalkar et al. [20] subset the SPEC CPU2006 benchmark suite 
in consideration of microarchitecture-dependent and microarchitecture-independent 
characteristics. We will perform system-independent characterization work on repre¬ 
sentative big data workloads in near future. 


7 Conclusion 

In this paper, we choose 45 metrics from a perspective of micro-architectural charac¬ 
teristics. On the basis of a comprehensive big data benchmark suite—BigDataBench, 
we reduce 77 workloads to 17 representative one. 

We compare the representative big data workload subset with SPECINT, SPEC- 
CEP, PARSEC, HPCC, CloudSuite, and TPC-C. To investigate the impact of dif¬ 
ferent software stacks, we also add six workloads implemented with MPI (the same 
workloads included in the representative big data workloads). We found that the big 
data workloads are data movement dominated computing with more branch opera¬ 
tions, which takes up to 92% percentage in terms of instruction mix. Comparing with 
the traditional workloads i. e. PARSEC, the big data workloads have larger instruc¬ 
tion footprint. Furthermore, there are signihcant disparities of front-end efficiencies 
among different subclasses of big data workloads. Finally, software stacks that fail to 
use state-of-practise processors efficiently are one of the main factors leading to high 
front-end stalls. In addition to innovative hardware design, we should pay great at¬ 
tention to co-design of software and hardware so as to use state-of-practise processors 
efficiently. 
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